NSF PAR Search | NSF Public Access Repository

Application of Large Language Models in Chemistry Reaction Data Extraction and Cleaning

https://doi.org/10.1145/3627673.3679874

Huang, Xiaobao; Surve, Mihir; Liu, Yuhan; Luo, Tengfei; Wiest, Olaf; Zhang, Xiangliang; Chawla, Nitesh V (October 2024, ACM)

Chemical reaction data has existed and still largely exists in unstructured forms. But curating such information into datasets suitable for tasks such as yield and reaction outcome prediction is impractical via manual curation and not possible to automate through programmatic means alone. Large language models (LLMs) have emerged as potent tools, showcasing remarkable capabilities in processing textual information and therefore could be extremely useful in automating this process. To address the challenge of unstructured data, we manually curated a dataset of structured chemical reaction data to fine-tune and evaluate LLMs. We propose a paradigm that leverages prompt-tuning, fine-tuning techniques, and a verifier to check the extracted information. We evaluate the capabilities of various LLMs, including LLAMA-2 and GPT models with different parameter counts, on the data extraction task. Our results show that prompt tuning of GPT-4 yields the best accuracy and evaluation results. Fine-tuning LLAMA-2 models with hundreds of samples does enable them and organize scientific material according to user-defined schemas better though. This workflow shows an adaptable approach for chemical reaction data extraction but also highlights the challenges associated with nuance in chemical information. We open-sourced our code at GitHub.

Full Text Available

Search for: All records